Part-of-Speech Tagging with Two Sequential Transducers 



Andre Kempe 
Xerox Research Centre Europe 

Abstract 

The article presents a method of constructing and applying a cascade consisting of a left- and a 
right-sequential finite-state transducer, Ti and T2, for part-of-speech disambiguation. In the process 
of POS tagging, every word is first assigned a unique ambiguity class that represents the set of 
alternative tags that this word can occur with. The sequence of the ambiguity classes of all words 
of one sentence is then mapped by Ti to a sequence of reduced ambiguity classes where some of 
the less likely tags are removed. That sequence is finally mapped by T2 to a sequence of single 
tags. Compared to a Hidden Markov model tagger, this transducer cascade has the advantage of 
significantly higher processing speed, but at the cost of slightly lower accuracy. Applications such 
as Information Retrieval, where the speed can be more important than accuracy, could benefit from 
this approach. 

1 Introduction 

We present a method of constructing and applying a cascade consisting of a left- and a right-sequential 
finite-state transducer (FST), Ti and T2, for part-of-speech (POS) disambiguation. 

In the process of POS tagging, we first assign every word of a sentence a unique ambiguity class Ci that 
can be looked up in a lexicon encoded by a sequential FST. Every a is denoted by a single symbol, e.g. 
" [AD J NOUN] " , although it represents a set of alternative tags that a given word can occur with. The sequence of 
the Ci of all words of one sentence is the input to our FST cascade (Fig. |l|). It is mapped by Ti, from left to right, 
to a sequence of reduced ambiguity classes Vi. Every is denoted by a single symbol, although it represents a 
set of alternative tags. Intuitively, Ti eliminates the less likely tags from Ci, thus creating r^. Finally, T2 maps 
the sequence of r^, from right to left, to an output sequence of single POS tags ti. Intuitively, T2 selects the most 
likely ti from every (Fig. ^). 

. . . [DET RELPRO] [ADJ NOUN] [ADJ NOUN VERB] [VERB] . . . 

4 mapping left to right > ij- 

. . . [DET RELPRO] [ADJ] [ADJ NOUN] [VERB] . . . 

4J. < mapping right to left ij. 

DET ADJ NOUW VERB 

Figure 1: Part of an input, an intermediate, and an output sequence in the FST cascade (example) 



Compared to a Hidden Markov model (HMM) (Rabinei L990| ), this FST cascade has the advantage of sig- 



nificantly higher processing speed, but at the cost of slightly lower accuracy. Applications such as Information 
Retrieval, where the speed can be more important than accuracy, could benefit from th is approach. 



Although ou r approach is related to the concept of bimachines (Schiitzenberger 1961) and factorization (Elgot 



and Mezei 1965), we proceed differently in that we build two sequential FSTs directly and not by factorization. 



This article is structured as follows. Section g describes how the ambiguity classes and reduced ambiguity 
classes are defined based on a lexicon and a training corpus. Then, Section ^ explains how the probabilities of 
these classes in the context of other classes are calculated. The construction of T\ and T2 is shown in Section |^. 
It makes use of the previously defined classes and their probabilities. Section [s] describes the application of the 
FSTs to an input text, and Section ^ finally compares the FSTs to an HMM tagger, based on experimental data. 



2 Definition of Classes 



Instead of dealing with lexical probabilities of individual words ( Church 198^ ), many POS taggers group words 



into a mbiguity clas ses and deal with lexical probabilities of these classes (Cutting, Kupiec, Pedersen and Sibun 
1992, Kupiec 1992| ). Every word belongs to one ambiguity class that is described by the set of all POS tags 



that the word can occur with. For example, the class described by {NOUN, VERB} includes all words that could be 
analyzed either as noun or verb depending on the context. We follow this approach 



Some approaches make a m ore fine-grained word classification (Daelemans, Zavrel, Berck and Gilli; 



1996 



Tzoukermann and Radev 199(^ ). Words that occur with the same alternative tags, e.g., NOUN and VERB, can here 



be assigned different ambiguity classes depending on whether they occur more frequently with one or with the 
other tag. Although this has proven to increase the accuracy of HMM-based POS disambiguation, it did not 
significantly improve our method. After some investigations in this direction, we decided to follow the simpler 
classification above. 



Before we can build the FST cascade, we have to define ambiguity classes, that will constitute the input 
alphabet of Ti, and reduced ambiguity classes, that will form the intermediate alphabet of the cascade, i.e., the 
output of Ti and the input of T2. 

Ambiguity classes d are defined from the training corpus and lexicon, and are each described by a pair 
consisting of a tag list t{ci) and a probability vector p{ci) : 



For example: 



t{Ci) = {til,U2, ■■■,ti,„) P{Ci) = 



t(ci) = (ADJ, NOUN, VERB) p(ci) 



p{til\Ci) 
p{ti2\Ci) 

P{ti,n\ci) 



0.29 
0.60 
0.11 



(1) 



(2) 



which means that the words that belong to ci are tagged as ADJ in 29 %, as NOUN in 60 %, and as VERB in 11 % 
of all cases in the training corpus. 

When all a are defined, a class-based lexicon, that maps every word to a single class symbol, is constructed 
from the original tag-based lexicon, that maps every word to a set of alternative tag symbols. In the class-based 
lexicon, the above ci (Eq. ^ could be represented, e.g., by the symbol "[ADJ NOUN VERB]". 

We describe a reduced ambiguity classes Vi also by a pair consisting of a tag list t{ri) and a probability vector 
p{ri) . Intuitively, an can be seen as a d where some of the less likely tags have been removed. Since at this 
point we cannot decide which tags are less likely, all possible subclasses of all Ci are considered. To generate a 
complete set of r;, all d are split into all possible subclasses Sij that are assigned a tag list t{sij) containing a 
subset of the tags of t{ci), and an (un- normalized) probability vector p{sij) containing only the relevant elements 
of p{ci) . For example, the above ci (Eq. H) is split into seven subclasses sij : 



i(si,o) = (ADJ, NOUN, VERB) 



(NOUN, VERB) 



P(si,o) = 



P(si 



i(si,2) = (ADJ, VERB) 
etc. 



p{si 



0.29 
0.60 
0.11 

0.60 
0.11 

0.29 
0.11 



(3) 



2 



Different a can produce a Sij with the same tag hst t[sij) but with different probability vectors p{sij) ; e.g., 
the classes with the tag lists (AD J, NOUN, VERB), (NOUN, VERB), and (ADJ, ADV, NOUN, VERB) can all produce a subclass 
with the tag list (NOUN, VERB) . To reduce t he total nurnber o f subclasses, all Sij with the same tag list t{sij) 
are clustered, based on the centroid method (Romesburg 1989, p. 136), using the vector cosine as the similarity 



measure between clusters (3alton and McGill 
class ry. If we obtain, e.g. 
probability vectors: 



three 



198S, p. 201). Each final cluster constitutes a reduced ambiguity 
y with the same tag list t(ry) = (NOUN, VERB) but with different (re- normalized) 



ri 



0.89 
0.11 



0.57 
0.43 



0.09 
0.91 



(4) 



we represent them in an FST by three different symbols, e.g. 
VERB] _R_3" . 



' [NOUN VERB] _R_1" , " [NOUN VERB] _R_2" , and " [NOUN 



3 Contextual Probabilities 

Ti will map a sequence of Ci, from left to right, to a sequence of r^. Therefore, the construction of Ti requires 
estimating the most likely ri in the context of both the current Ci and the previous n-i (wrt. the current position 
i in a sequence). To determine this ri, a probability Priitij) is estimated for every POS tag tij in Ci . In the 
initial position, Priitij) depends on the preceding sentence boundary and the current Ci which are assumed 

to be mutually independent: 

PTiiUj) = Ci) 

_ p{tij Ci) 
p{#i-l Ci) 

_ P(#i-1 Ci\tij) ■ p{tij) 
P(#i-1 Ci) 

_ P(#i-l|^»j) -Pjctltij) -pjtij) 
~ P(#i-l) ■P(ci) 



P(#i-i) ■P(ci) 

p{tij #i-l) -pitij Cj) 
P{tij) •p(#i_i) ■p(ci) 

P{tij\#i-l) ■ p{tij\Ci) 



Pitij) 



(5) 



The latter p(fij |ci) can be extracted from the probability vector p(ci), and p(tij |#i^i) and p{tij) can be estimated 
from the training corpus. 

In another than the initial position, Priitij) depends on the preceding r^-i and the current Ci which are 
assumed to be mutually independent: 

PtAu.) =P(%ln-. c.) . ^fel^--;)-H^-^-I^O 

pytij) 

The latter p{tij\ri-i) is estimated by: 

p{tij\ri-i) = ^p(tij|ii_i,fc) ■p(ii_i,fc|r-i_i) (7) 

k 

with tij G i{ci) ■ ti_i,fc G i{ri-i) 

where p(tij |ti-i,fc) can be estimated from the training corpus, and p{ti-i^k\ri-i) can be extracted from the 
probability vector p(ri_i) of the preceding r^-i. 
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To evaluate all tags of the current Ci , a list V{ci) containing pairs (tij , Pxiitij) ) of all tags tij of Ci with 
their probabilities Priitij) (Eq.s ^, is created: 



{U^2, PrA-t^a)) 



(8) 



Every tag tij in V is compared to the most likely tag ti^m in V. If the ratio of their probabilities is below a 



threshold r , tij is removed from "P 



P (t ) < ^ 

Removing less likely tags leads to a reduced list Pr{ci) that is then split into a reduced tag list tr{ci) and a 
reduced probability vector Pr{ci) that jointly describe a reduced ambiguity class Vy . From among all predefined 
ri (cf. e.g. Eq. we select the one that has the same tag list i{ri) as the "ideal" reduced class Vy and the most 
similar probability vector p{ri) according to the cosine measure. This is considered to be the most likely among 
all predefined ri in the context of both the current Ci and the previous ri-i. 



T2 will map a sequence of r^, from right to left, to a sequence of tags ti. Therefore, the construction of T2 
requires estimating the most likely ti in the context of both the current Vi and the following ti+i. To determine 
this ti, a probability Pt2 (tij) is estimated for every tag tij of the current ri . In the final position, Pt2 (tij) depends 
on the current ri and on the following sentence boundary : 



-p{t,j\r, #H 



In another than the final position, PT2i'tij) depends on the current and the following tag ti+i 



Priitij) = p(ti]\ri ti+i) 



p(tij\ti+x) ■ p{tij\ri) 
p(tij) 



(10) 



(11) 



The latter p{tij), p{tij\ti+i), and p{tij\^i+i) are estimated from the training corpus, and p{tij\ri) is extracted 
from the probability vector p{ri) . 

The ti with the highest probability Pt2 (ti) is the most likely tag in the context of both the current ri and the 
following ti+i (Eq.s 0, 0. 



4 Construction of the FSTs 

The construction of Ti is preceded by defining all Ci and r^, and estimating their contextual probabilities. In this 
process, all words in the training corpus, that are initially annotated with POS tags, are in addition annotated 
with ambiguity classes Ci. 

In Ti, one state is created for every ri (output symbol), and is labeled with this ri (Fig. ^). An initial state, 
not corresponding to any ri, is created in addition. From every state, one outgoing arc is created for every a 
(input symbol), and is labeled with this d. The destination of every arc is the state of the most likely ri in 
the context of both the current Ci (arc label) and the preceding ri_i (source state label) which is estimated as 
described above. All arc labels are then changed from simple symbols Ci to symbol pairs Ci -.ri (mapping Ci to r^) 
that consist of the original arc label and the destination state label. All state labels are removed (Fig. ^). Those 
n that are unlikely in any context disappear from Ti because the corresponding states have no incomming arcs. 
Ti accepts any sequence of Ci and maps it, from left to right, to the sequence of the most likely ri in the given 
left context. 

The construction of T2 is preceded by annotating the training corpus in addition with reduced ambiguity classes 
ri, by means of Ti. The probability vectors pf r^) of all ri are then re-estimated. The contextual probabilities of 
tags, are estimated only at this point (Eq.s hd, hll) . 

In T2, one state is created for every ti (output symbol), and is labeled with this ti (Fig. ^). An initial state 
is added. From every state, one outgoing arc is created for every ri (input symbol) that occurs in the output 
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Figure 2: Two stages in the construction of Ti 



language of Ti, and is labeled with this n. The destination of every arc is the state of the most likely ti in 
the context of both the current (arc label) and the following t^+i (source state label) which is estimated as 
described above. Note, this is the following tag, rather than the preceding, because T2 will be applied from right 
to left. All arc labels are then changed into symbol pairs -.ti and all state labels are removed (Fig. |^), as was 
done in T\. T2 accepts any sequence of r^, generated by Ti, and maps it, from right to left, to the sequence of the 
most likely ti in the given right context. 




Figure 3: Two stages in the construction of T2 



Both Ti and T2 are sequential. They can be minimized with standard algorithms. Once Ti and T2 are built, 
the probabilities of all ti, Ti, and Ci are of no further use. Probabilities do not explicitly occur in the FSTs, and 
are not directly used at run time. They are, however, "reflected" by the structure of the FSTs. 

5 Application of the FSTs 

Our FST tagger uses the above described T\ and T2, a class-based lexicon, and possibly a guesser to predict the 
ambiguity classes of unknown words (possibly based on their suffixes). The lexicon and guesser are also sequential 
FSTs, and map any word that they accept to a single symbol d representing an ambiguity class (Fig. hi). If a 
word cannot be found in the lexicon, it is analyzed by the guesser. If this does not provide an analysis either, 
the word is assigned a special Ci for unknown words that is estimated from the m most frequent tags of all words 
that occur only once in the training corpus. 

The sequence of the d of all words of one sentence is the input to our FST cascade (Fig. ^). It is mapped by 
Ti, from left to right, to a sequence of reduced ambiguity classes r^. Intuitively, Ti eliminates the less likely tags 
from Ci, thus creating r^. Finally, T2 maps the sequence of Ti, from right to left, to an output sequence of single 
POS tags ti. Intuitively, T2 selects the most likely ti from every . 

6 Results 

We compared our FST tagger on English, German, and Spanish with a commercially available (foreign) HMM 
tagger (Table |l|) . The comparison was made on the same non-overlapping training and test corpora for both taggers 
(Table The FST tagger was on average 10 times as fast but slightly less accurate than the HMM tagger (45 600 
words/sec and 96.97% versus 4 360 words/sec and 97.43%). In some applications such as Information Retrieval 
a significant speed increase could be worth the small loss in accuracy. 
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English German Spanish 


Average 


Speed 

(words/sec) 


T1+T2 
HMM 


47 600 42 200 46 900 

4 110 3 620 5 360 


45 600 

4 360 


Accuracy (%) 


T1+T2 
HMM 


96.54 96.79 97.05 
96.80 97.55 97.95 


96.97 
97.43 



Computer: SUN Workstation, Ultra2, with 1 CPU 
Table 1: Processing speed and accuracy of the FST and the HMM taggers 





English 


German 


Spanish 


Average 


# States 


615 


496 


353 


488 


# Arcs 


209 000 


197 000 


96 000 


167 000 


#Tags 


76 


67 


56 


66 


# Ambiguity classes 


349 


448 


265 


354 


# Reduced ambiguity 


724 


732 


465 


640 


classes 











Table 2: Sizes of the FST cascades and their alphabets 





English German Spanish 


Average 


Training corpus size 
(words) 

Test corpus size (words) 


20 000 91 000 16 000 
20 000 40 000 15 000 


42 000 
25 000 



Table 3: Sizes of the training and test corpora 
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